Requirements Engineering: focus on Natural Language Processing, Lecture 2alessio_ferrari
In this lecture, we give a practical guide on how to detect ambiguities in natural language requirements by means of GATE and by means of Python. A brief guide to Python is also included.
The previous lecture gives an introduction to the problem of ambiguity in requirements engineering. Find it here: https://www.slideshare.net/alessio_ferrari/requirements-engineering-focus-on-natural-language-processing-lecture-1
"Bilingual Terminology Extraction from TMX. A state-of-the-art overview." Presentation at Translating Europe Forum 2016. Focus on translation technology.
Requirements Engineering: focus on Natural Language Processing, Lecture 2alessio_ferrari
In this lecture, we give a practical guide on how to detect ambiguities in natural language requirements by means of GATE and by means of Python. A brief guide to Python is also included.
The previous lecture gives an introduction to the problem of ambiguity in requirements engineering. Find it here: https://www.slideshare.net/alessio_ferrari/requirements-engineering-focus-on-natural-language-processing-lecture-1
"Bilingual Terminology Extraction from TMX. A state-of-the-art overview." Presentation at Translating Europe Forum 2016. Focus on translation technology.
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-RuleML
UML class diagrams (UCDs) are a widely adopted formalism
for modeling the intensional structure of a software system. Although
UCDs are typically guiding the implementation of a system, it is common
in practice that developers need to recover the class diagram from an
implemented system. This process is known as reverse engineering. A
fundamental property of reverse engineered (or simply re-engineered)
UCDs is consistency, showing that the system is realizable in practice.
In this work, we investigate the consistency of re-engineered UCDs, and
we show is pspace-complete. The upper bound is obtained by exploiting
algorithmic techniques developed for conjunctive query answering under
guarded Datalog+/-, that is, a key member of the Datalog+/- family
of KR languages, while the lower bound is obtained by simulating the
behavior of a polynomial space Turing machine.
RuleML2015: Rule Generalization Strategies in Incremental Learning of Disjunc...RuleML
Symbolic Machine Learning systems and applications, especially
when applied to real-world domains, must face the problem of
concepts that cannot be captured by a single definition, but require several
alternate definitions, each of which covers part of the full concept
extension. This problem is particularly relevant for incremental systems,
where progressive covering approaches are not applicable, and the learning
and refinement of the various definitions is interleaved during the
learning phase. In these systems, not only the learned model depends
on the order in which the examples are provided, but it also depends on
the choice of the specific definition to be refined. This paper proposes
different strategies for determining the order in which the alternate definitions
of a concept should be considered in a generalization step, and
evaluates their performance on a real-world domain dataset.
Rulelog is in process of industry standardization via RuleML and W3C:
RIF-Rulelog specification, version of of May 24, 2013, Michael Kifer, ed. RIF-Rulelog is a powerful dialect of W3C Rule Interchange Format (RIF) that is in draft as a submission from RuleML to W3C.
Several industry standards in the areas are based heavily on our team’s contributions to the authoring/editing of the specifications and conducting the underlying research and earlier-phase standards design. These include most notably the two most important industry standards on rules knowledge:
W3C Rule Interchange Format (RIF), which is primarily based on the RuleML standards design (semantic web rules)
W3C OWL 2 RL Profile (rule-based web ontologies)
The team has also contributed to the development of W3C SPARQL and ISO Common Logic, and been strongly involved in other related standardization efforts at OMG and Oasis.
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML
With recent regulatory advances, modern enterprises have to not only comply with regulations but have to be prepared to provide explanation of proof of (non-)compliance. On top of compliance checking, this necessitates modeling concepts from regulations and enterprise operations so that stakeholder-specific and close to natural language explanations could be generated. We take a step in this direction by using Semantics of Business Vocabulary and Rules to model and map vocabularies of regulations and operations of enterprise. Using these vocabularies and leveraging proof generation abilities of an existing compliance engine, we show how such explanations can be created. Basic natural language explanations that we generate can be easily enriched by adding requisite domain knowledge to the vocabularies.
Knowledge Based Reasoning: Agents, Facets of Knowledge. Logic and Inferences: Formal Logic,
Propositional and First Order Logic, Resolution in Propositional and First Order Logic, Deductive
Retrieval, Backward Chaining, Second order Logic. Knowledge Representation: Conceptual
Dependency, Frames, Semantic nets.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Introduction to Ontology Engineering with Fluent Editor 2014Cognitum
An introductory course for Ontology Engineering using Controlled Natural Language. Fluent Editor (FE) is an ontology editor that is a tool for editing and manipulating ontologies. The main feature of Fluent Editor is that it uses controlled natural language (CNL) to communicate with a user. Communication with CNL is a more suitable for human users alternative to XML-based OWL editors.
Latent Topic-semantic Indexing based Automatic Text SummarizationElaheh Barati
Automatic summarization, a difficult but pressing problem in natural language processing, aims at shortening source documents while retaining main information. In recent years, more statistical machine learning methods have been applied to automatic summarization. In this paper, we propose a novel approach for summarization, based on hierarchical Bayesian model of topic-semantic indexing (TSI) and extraction strategy of average log-likelihood. The new method is tested on Brown corpus, and its performance is analyzed by a well-designed blind experiment of one-way ANOVA on human reviews. The experimental results show that TSI model is promising on topic- driven summarization.
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-RuleML
UML class diagrams (UCDs) are a widely adopted formalism
for modeling the intensional structure of a software system. Although
UCDs are typically guiding the implementation of a system, it is common
in practice that developers need to recover the class diagram from an
implemented system. This process is known as reverse engineering. A
fundamental property of reverse engineered (or simply re-engineered)
UCDs is consistency, showing that the system is realizable in practice.
In this work, we investigate the consistency of re-engineered UCDs, and
we show is pspace-complete. The upper bound is obtained by exploiting
algorithmic techniques developed for conjunctive query answering under
guarded Datalog+/-, that is, a key member of the Datalog+/- family
of KR languages, while the lower bound is obtained by simulating the
behavior of a polynomial space Turing machine.
RuleML2015: Rule Generalization Strategies in Incremental Learning of Disjunc...RuleML
Symbolic Machine Learning systems and applications, especially
when applied to real-world domains, must face the problem of
concepts that cannot be captured by a single definition, but require several
alternate definitions, each of which covers part of the full concept
extension. This problem is particularly relevant for incremental systems,
where progressive covering approaches are not applicable, and the learning
and refinement of the various definitions is interleaved during the
learning phase. In these systems, not only the learned model depends
on the order in which the examples are provided, but it also depends on
the choice of the specific definition to be refined. This paper proposes
different strategies for determining the order in which the alternate definitions
of a concept should be considered in a generalization step, and
evaluates their performance on a real-world domain dataset.
Rulelog is in process of industry standardization via RuleML and W3C:
RIF-Rulelog specification, version of of May 24, 2013, Michael Kifer, ed. RIF-Rulelog is a powerful dialect of W3C Rule Interchange Format (RIF) that is in draft as a submission from RuleML to W3C.
Several industry standards in the areas are based heavily on our team’s contributions to the authoring/editing of the specifications and conducting the underlying research and earlier-phase standards design. These include most notably the two most important industry standards on rules knowledge:
W3C Rule Interchange Format (RIF), which is primarily based on the RuleML standards design (semantic web rules)
W3C OWL 2 RL Profile (rule-based web ontologies)
The team has also contributed to the development of W3C SPARQL and ISO Common Logic, and been strongly involved in other related standardization efforts at OMG and Oasis.
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML
With recent regulatory advances, modern enterprises have to not only comply with regulations but have to be prepared to provide explanation of proof of (non-)compliance. On top of compliance checking, this necessitates modeling concepts from regulations and enterprise operations so that stakeholder-specific and close to natural language explanations could be generated. We take a step in this direction by using Semantics of Business Vocabulary and Rules to model and map vocabularies of regulations and operations of enterprise. Using these vocabularies and leveraging proof generation abilities of an existing compliance engine, we show how such explanations can be created. Basic natural language explanations that we generate can be easily enriched by adding requisite domain knowledge to the vocabularies.
Knowledge Based Reasoning: Agents, Facets of Knowledge. Logic and Inferences: Formal Logic,
Propositional and First Order Logic, Resolution in Propositional and First Order Logic, Deductive
Retrieval, Backward Chaining, Second order Logic. Knowledge Representation: Conceptual
Dependency, Frames, Semantic nets.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Introduction to Ontology Engineering with Fluent Editor 2014Cognitum
An introductory course for Ontology Engineering using Controlled Natural Language. Fluent Editor (FE) is an ontology editor that is a tool for editing and manipulating ontologies. The main feature of Fluent Editor is that it uses controlled natural language (CNL) to communicate with a user. Communication with CNL is a more suitable for human users alternative to XML-based OWL editors.
Latent Topic-semantic Indexing based Automatic Text SummarizationElaheh Barati
Automatic summarization, a difficult but pressing problem in natural language processing, aims at shortening source documents while retaining main information. In recent years, more statistical machine learning methods have been applied to automatic summarization. In this paper, we propose a novel approach for summarization, based on hierarchical Bayesian model of topic-semantic indexing (TSI) and extraction strategy of average log-likelihood. The new method is tested on Brown corpus, and its performance is analyzed by a well-designed blind experiment of one-way ANOVA on human reviews. The experimental results show that TSI model is promising on topic- driven summarization.
In this talk I intend to review some basic and high-level concepts like formal languages, grammars and ontologies. Languages to transmit knowledge from a sender to a receiver; grammars to formally specify languages; ontologies as formals specifications of specific knowledge domains. After this introductory revision, enhancing the role of each of those elements in the context of computer-based problem solving (programming), I will talk about a project aimed at automatically infer and generate a Grammar for a Domain Specific Language (DSL) from a given ontology that describes this specific domain. The transformation rules will be presented and the system, Onto2Gra, that fully implements that "Ontological approach for DSL development" will be introduced.
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
In this slides the basic concept of machine translation is described.MT challenges are represented and describes rule-based and statistical MT briefly. Some notes about evaluation is described too
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
The paper accepted on ICSE'17 and TSE'19. https://se-thesaurus.appspot.com/ https://pypi.org/project/DomainThesaurus/ Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonlyused morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
This paper presents a rule based model of parts of speech (POS) tagset for Classical Tamil Texts (CTT). The noun forms are type pattern, verb forms are token pattern. This is based on form agreement method. This is a very efficient and novel approach because Tamil Language has a build-in system of agreement/concord of the sentence. Classical Tamil Tagset is divided into two basic classifications, noun morphology and verb morphology.
This paper presents a set of linguistically informed and motivated multilingual alignments -- the CLUE4Translation Alignments -- covering several categories of multiwords and phrasal units, which constitute important challenges to high quality machine translation. The alignments comprise all possible word combinations between English, French, Portuguese, and Spanish parallel texts of the common test set of the Europarl corpus. The gold collection of the manually annotated alignments -- the Gold-CLUE-Translation -- is constituted of 400 sentences aligned according to previously proposed guidelines -- CLUE4Translation Alignment Guidelines -- for each language pair, resulting in a set of 2,400 alignments. The alignments were performed with the support of a new alignment tool -- CLUE-Aligner -- developed to facilitate the alignment of the translation units in the bitexts, including the alignment of non-contiguous multiwords and phrasal translation units. The Gold CLUE4Translation, the CLUE-Aligner, and the CLUE4Translation Alignment Guidelines are publicly available.
Poster presented at the 2nd meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Alexandru Ioan Cuza University, in Iasi, Romania.
This poster shows paraphrastic suggestions in the eSPERTo paraphrasing system applied to a QA application on a virtual agent and to a summarization tool. It also shows how paraphrases can be used in language learning and the tests envisaged to make eSPERTo a Portuguese learning tool.
This paper introduces the state-of-the-art machine translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency , adequacy, comprehension, and in-formativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories , including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms , textual entailment, paraphrase, semantic roles, and language models. Subsequently , we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT.
This paper differs from the existing works (Dorr et al., 2009; EuroMatrix, 2007) from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content. For latest version, please goto: https://arxiv.org/abs/1605.04515
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
(Final) cidoc 2009 chinese lang translation of the aat
1. Chinese-language Art & Architecture Thesaurus: Methods and Issues Shu-Jiun (Sophy) Chen Program Office, Taiwan e-Learning and Digital Archives Program CIDOC 2009, Chile, 29 September 2009 Special Session on Multilingual Access: "Developing Tools for Multilingual Access to Cultural Materials: the Experience of the Art & Architecture Thesaurus (AAT)"
7. Proofread (T3) 校訂 (T3) Proofread 統一詞彙翻譯與格式 (T3_U) Unified terms and format 翻譯品質管理 (T3_QC) Translation Quality Control 募集校訂人員 (T3_R) Proofreaders recruitment T3_QC2 : make sure the translated scope note is both fluent and idiomatic Frequent problems: 1. Translated text is redundant or awkward, and does not assimilate to the cultural context of the target language. Not understandable in Chinese
8. Scope Note of New Concept (N) Cross-Division Coordination 跨分項協調 (C) 範圍註撰寫 Scope note of new concept (N) 搜集資料 (N1) Data Collection 撰寫範圍註 (N2) Create a new record with required fields 遵循 Getty 和 TELDAP 的指引 Follow editorial guideline to create a record in Chinese 參考權威資料 Authoritative References 增加相關圖檔 ( N 3) Find images Equivalence Mapping 等同關係對照 (M) Expert Group 專家審訂 (E) Translation 翻譯 (T) 中翻英 (T2_CE) Chinese to English Translation System Development 系統開發 (S) 著錄翻譯內容、圖片、參考書目 Input data into the system 校訂 (T3) Proofread 內容審訂 (E2) Content Verification 部分等同 (M3_PE) Partial Equivalence review approved
9. Expert Group (E) Cross-Division Coordination 跨分項協調 (C) 專家審訂 Expert Group (E) 前置作業 (E1) Preparation Translation 翻譯 (T) 內容準確性 (T3_A) Translation Accuracy 1 確認內容 (E1_I) Identify Content 審訂表整理更新 (E1_F) File compilation and update 內容審訂 (E2) Content Verification Content experts need to verify the accuracy of the provided terms. If they determine it’s accurate, then the data will be inputted into the system; if not, revision is needed before the data can be submitted to AAT and AAT-Taiwan. Scope Note 範圍註 (N) 完成新增中文詞彙所有的欄位資料 (N3) Create a new record with required fields 專家修訂 (E2_RE) Revision by Expert 範圍註準確性 (E2_NA) Scope Note Accuracy 翻譯準確性 (E2_TA) Translation Accuracy System Development 系統開發 (S) 著錄翻譯內容、圖片、參考書目 Input data into the system (a) approved review (b) reject
30. Thank you for your attention! Photographer: unknown , Source: Cyber Island http://cyberisland.ndap.org.tw/game_item.php?cid=43&album=969&parent=11
Editor's Notes
目標: 呈現臺灣文化與自然多樣性 促成典藏內容與科技融入產業、教育、研究與社會發展 建立數位典藏與學習產業 深化數位學習在正規教育及終身教育的應用 奠定語文數位教學的國際地位 推動數位典藏與學習成果國際化、建立國際合作網路 To build a life-long knowledge discovery environment and bridge the digital divide. Serve as a leader in information technology and social development. To showcase Taiwan’s biological, ecological, cultural, and social diversities
目標: 呈現臺灣文化與自然多樣性 促成典藏內容與科技融入產業、教育、研究與社會發展 建立數位典藏與學習產業 深化數位學習在正規教育及終身教育的應用 奠定語文數位教學的國際地位 推動數位典藏與學習成果國際化、建立國際合作網路 To build a life-long knowledge discovery environment and bridge the digital divide. Serve as a leader in information technology and social development. To showcase Taiwan’s biological, ecological, cultural, and social diversities